2024-12-24 de novo prelim

Introduction

3 days ago i was looking through the gtdbtk manual and saw that de_novo_wf was an option for analysis to create the trees, from the description given:

knitr::include_url("https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html")

i beleived this would be something i should do as it might produce more accurate trees. sample 1Dt2d Enterobacter cancerogenus had been placed by the classify_wf in the previous gtdbtk analysis in the genus Pantoea, which lead me to this search. after a bit of trial and error, i produced this script

Methods

This ran as a slurm job on hawk (SCW) from rougly 20:10 on the 23rd to 01:00 on the 24th, totalling 4 hours and 50 minutes. The main parameters that i experimented with were

- #SBATCH --ntasks=5
- #SBATCH --time=24:00:00
- #SBATCH --mem=50g
- --cpus 10

I settled on these as being the “best”, however, it is entirely possible that they could be more optimised.

Results

This analysis produced these files:

/scratch/scw2160/02_outputs/flye_asm/gtdb_tk_de_novo5/
.:
text.txt
ls
touch
list.txt
align
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-table
gtdbtk.log
identify
infer
gtdbtk.warnings.log

./align:
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv

./identify:
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.translation_table_summary.tsv
gtdbtk.failed_genomes.tsv

./infer:
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-taxonomy
gtdbtk.bac120.decorated.tree-table
intermediate_results

./infer/intermediate_results:
gtdbtk.bac120.rooted.tree
gtdbtk.bac120.fasttree.log
gtdbtk.bac120.tree.log
gtdbtk.bac120.unrooted.tree

I then moved this gtdbtk.bac120.decorated.tree file into Dendroscope for review, all 10 are on one tree, but 1Dt2d is still being placed in the “wrong” genus. on review of its sister accession on the ncbi database.

Conclusion

On the NCBI page for the sister accession, can be found a CheckM analysis that comes back with

completeness: 90%
contamination: 3.6%
Taxonomy check status: failed

Upon viewing the tree in Dendroscope, the joining node has the label 0.968. This I believe to be the probability the relationship is correct. this implies they are the same species, and the online sample is also identified as Enterobacter cancerogenus. However, due to the checkm analysis i find it plausible that they both have been misidentified and are in reality Pantoea species, i find this the most parsimonious explanation. I will follow this up with a CheckM analysis of my own on 1Dt2d

dendroscope screenshot showing location and relationship for 1Dt2d after de novo analysis
dendroscope screenshot showing location and relationship for 1Dt2d after de novo analysis

This was a “technical spike” or proof of concept for de_novo_wf

📌 TODO: do another Checkm analysis on 1Dt2d to see if the values are similar to the online sample

2024-12-25 🎄 beginning of table, checkm

Introduction

i wanted to see if the outputs of checkm differed from checkm2 so ran that on hawk. I also began recreating the innital table for metadata about the bangor samples, in the spirit of automation, a less manual approach was chosen this time around.

Methods

using this script i ran a slurm job on hawk under the lineage_wf of CheckM for all 10 Bangor-made samples, this took just 4 minutes. I also worked on exporting the data i want to tabulate off of hawk. The past way i did this was by manually entering each file and noting down the important characteristics. However, because there are going to be more samples(and i wanted to be clever) i decided to use a more automated process. This was done by identifying different documents in the flye directories on hawk, specifically files called “assembly_info.txt” which contain the same information, but are vastly more exportable. These are stored here: cd /scratch/scw2160/02_outputs/flye_asm/flye_asm_[accession]/ using this script i exported them off of hawk.

Results

the CheckM analysis produced this output directory. My export script exported all 10 “assembly_info.txt” files to a directory in my home directory, as well as adding their accession ID to the name, this is important as otherwise i wouldnt know which belonged to what accession. I then brought them down and stored them here.

Conclusions

with it being christmas i did not take a serious look at the significance of the outputs of either, so that is what i plan to do next so that i can have some conclusions, maybe by the end of tomorrow. In conclusion, this process is only half done and will continue into the following entry(s).

📌 TODO: Complete analysis / processing of checkm and table stuff

2024-12-27 finishing the table and checkm stuff, maybe extra if there is time

Introduction

Basic stuff, finding out what i can do to the checkm analysis to turn it into publishable data, making the table using R scripts and the output files. exams in January are rapidly approaching, so i really do need to pick up the pace if i want to get this done before i begin revising. there is not a real goal for this exact piece, i just want to know how congruent the outputs from CheckM are to CheckM2.

Methods

Using the CheckM documentation. I found the output that i needed was bin_stats_ext.tsv. this is where the contamination(and a lot of other fun looking stats) are held. I can thusly import this file to R to try and extract usefull stuff / tabulate. Over the course of the day i developed and improved CheckMtablegenerator.R. which creates the table layed out in results below. This uses a few packages, one of which was new to me, gt. This is a very useful package for table generation.

Results

Table 1 - subset of data obtained from CheckM analysis
of samples taken from the skin microbiome of Dendrobates tinctorius, including metrics of quality.
Accession Marker lineage Completeness Contamination GC GC std Genome size contigs Longest contig Coding density Predicted genes
1Dt1h o__Actinomycetales 100.00 0.00 65.519% 4.284% 4,336,578 4 2,051,556 90.334% 3,818
1Dt2d f__Enterobacteriaceae 100.00 0.90 53.889% 1.270% 5,345,294 6 2,288,975 87.931% 4,937
1Dt100h o__Actinomycetales 98.84 0.19 52.446% 0.564% 2,379,473 2 2,349,045 89.586% 2,071
2Dt1l o__Actinomycetales 98.82 2.32 67.742% 0.844% 3,996,506 16 1,089,026 93.707% 3,915
3Dt1c o__Sphingomonadales 97.23 1.19 66.487% 1.385% 3,636,266 86 275,317 90.660% 3,433
3Dt2j o__Sphingomonadales 96.19 1.21 65.644% 1.752% 2,813,977 75 195,248 92.779% 2,786
3Dt2c o__Sphingomonadales 95.75 0.85 66.093% 1.604% 3,819,610 84 247,364 89.769% 3,516
3Dt2h o__Actinomycetales 87.98 0.51 70.327% 1.317% 3,006,227 166 125,244 90.324% 2,978
1Dt100g o__Sphingomonadales 76.77 0.00 68.105% 1.332% 2,104,071 153 74,994 92.645% 2,179
2Dt1e o__Actinomycetales 44.74 0.00 71.162% 1.565% 2,332,403 149 70,689 89.402% 2,215
Source: CheckM lineage_wf performed on 2024-12-25

As mentioned, this table is a subset of the data, there was more outputted by CheckM, some of which overlaps with other analyses, but i did not know if it was relevent to include, so left it out, however, creating a new one with the omitted fields in would not be hard. This table will pair with one produced from CheckM2, because from what i’ve seen one is kind of an improvement on the other, but they kind of do different things, its confusing and thus why i want to do the comparison for future reference. I did not know what counted as “significantly complete”, so arbitrarily placed the value at 90%. This analysis shows that 7 of the 10 samples are complete or near enough, with only 3 falling below my line. It should be noted that 2Dt1e, the lowest sample is only found to be 45% complete, this is a large outlier from the rest of the range. Contamination is fairly low in all, none above 3% and 3 samples on 0.00%.

Conclusion

The main purpose of doing this is to compare with Checkm2, so im not sure what conclusions i can get from just this, other than many of the samples look sufficiently complete and i learned a fair amount from this experience. I think this will act as a good first step to compare future analysis against, for confirmation, as many upcoming analyses will give repeat data in some areas. Some of the field, for example “GC” are odd and i dont quite understand their significance, perhaps Aaron can give more info when he gets to this part of the notebook.

📌 TODO: finish the straight flye analysis / start on CheckM2

2024-12-28 flye output tabulation and CheckM2 complete pipeline

Introduction

Today i wanted to finish off the work from yesterday and start on CheckM2. this should produce two tables that i can use to compare with the first one from Checkm, maybe find out which method is superior?

Methods

The order i did the two main focuses in shifted greatly during the day, with me bouncing between flye and CheckM2 analyses, so this section will have subheaders for better readability

Flye analysis 1: assembly_info.txt

I created an R script, “rawflyeanalysis.R”. Basic stuff, imports the text files i chose to download(i wonder if i could download the flye.log files and use a sub() to cut out the non-useful stuff). This then sorts and processes the data into the table seen below using a few packages, Table 2: Assembly_info.txt.

CheckM2: slurm job

Just before lunch(11:45-ish) i set off the slurm job to run CheckM2 on hawk. Let’s see how that goes after lunch, one thing of note though, i did make a minor modification, in the original file, i was running it on the fastas one at a time(?) weird, anyway, i changed it to work on the directory containing all 10, so it should do them all at once. run_checkm2TN.sh.

Flye analysis 2: flye.log

After lunch, firstly i went back to the flye.log analysis i wanted to do largely for posterity. This was horrifyingly complex due to the “complex” nature of the flye.log files. I did this by creating a modified copy of my assembly_info.txt file grabber shell script, called flye.loggetter.sh, using ctrl + h to find and replace instead of manually changing things. I added this to an existing output directory in my home section of hawk, which i exported using MobaXterm(do i need to include a section on how i use MobaXterm?) to my computer. This is the folder called flyestuff, which contains both the flye.log and the assembly_info.txt files for each accession, i grouped them as it seemed relevent for organisation and didnt hinder the coding process at all. The R file i used to tabulate this is flyeloganalysis.R. This produced Table 3: Flye.log

CheckM2: R table generation

This started with doing cd /scratch/scw2160/02_outputs/flye_asm/CheckM2_20241228 to see the outputs, the job took roughly half an hour, ending at 12:15 pm. Firstly, i copied the output directory to my home area, then once again used MobaXterm to bring it down to the git repo, here. Then, i began work on the R file to turn the quality_report.tsv file. This file was formatted very well, so turning it into a table was very simple, done in CheckM2tablegenerator.R. This produced Table 4: CheckM2

Note: there were other outputs to CheckM2, they were very interesting, im going to make the next section about exploring that possiblity, but they dont relate to this analysis.

Results

Table 2: Assembly_info.txt

Table 2 - Data obtained from flye outputs
of samples taken from the skin microbiome of Dendrobates tinctorius
Accession Number of contigs Total length Largest fragment Mean fragment length
3Dt2h 166 3,006,227 125,244 18,109.80
1Dt100g 153 2,104,071 74,994 13,752.10
2Dt1e 149 2,332,403 70,689 15,653.71
3Dt1c 86 3,636,266 275,317 42,282.16
3Dt2c 84 3,819,610 247,364 45,471.55
3Dt2j 75 2,813,977 195,248 37,519.69
2Dt1l 16 3,996,506 1,089,026 249,781.62
1Dt2d 6 5,345,294 2,288,975 890,882.33
1Dt1h 4 4,336,578 2,051,556 1,084,144.50
1Dt100h 2 2,379,473 2,349,045 1,189,736.50
Source: assembly_info.txt files produced in flye_asm analysis, August 27th 2024
coverage could not be calculated from this method as no result matched what was found in the flye.log files

The basic flye output analysis produced this table, as mentioned, i couldn’t get coverage to match what was on the flye.log file that i got the original of this table from, it is possible that file is the one that is wrong, but genome coverage is kind of all over the place, so i just omitted it as i don’t think its one of the mandatory fields. I chose to sort the table by “Number of contigs” rather arbitrarily, but it does show something interesting in that why is there such a large range in fragment number?

Table 3: Flye.log

Table 3 - Data obtained from flye.log outputs
of samples taken from the skin microbiome of Dendrobates tinctorius
Accession Fragments Total length Fragments N50 Largest fragment Scaffolds Mean coverage
3Dt2h 166 3,006,227 26,385 125,244 0 390
1Dt100g 153 2,104,071 18,022 74,994 0 318
2Dt1e 149 2,332,403 22,011 70,689 0 344
3Dt1c 86 3,636,266 63,365 275,317 0 320
3Dt2c 84 3,819,610 69,733 247,364 0 190
3Dt2j 75 2,813,977 71,970 195,248 0 350
2Dt1l 16 3,996,506 345,773 1,089,026 0 289
1Dt2d 6 5,345,294 1,019,507 2,288,975 0 225
1Dt1h 4 4,336,578 1,857,409 2,051,556 0 404
1Dt100h 2 2,379,473 2,349,045 2,349,045 0 224
Source: flye.log files produced in flye_asm analysis, August 27th 2024

This is a table of the type of stuff i first collated in the Summer, it shares many similarities and cross overs with Table 2, neither have quality statistics, but both are metadata largely concerning the length and fragment, a marked difference is this one comes with a concrete value of coverage. Comparative analysis of them feels like a thing that would happen in the conclusion to this document.

Table 4: CheckM2

Table 4 - quality analysis from CheckM2
based on samples taken from the skin microbiome of Dendrobates tinctorius
Accession Completeness Contamination Completeness model used Translation table used Additional notes
1Dt1h 100.00 0.31 Neural Network (Specific Model) 11 None
1Dt2d 100.00 0.87 Neural Network (Specific Model) 11 None
2Dt1l 100.00 0.27 Neural Network (Specific Model) 11 None
3Dt2c 99.94 0.39 Neural Network (Specific Model) 11 None
3Dt2j 99.80 0.00 Neural Network (Specific Model) 11 None
1Dt100h 99.12 0.00 Gradient Boost (General Model) 11 None
3Dt1c 98.34 0.86 Neural Network (Specific Model) 11 None
3Dt2h 86.04 0.10 Neural Network (Specific Model) 11 None
1Dt100g 78.42 0.33 Neural Network (Specific Model) 11 None
2Dt1e 48.29 0.08 Neural Network (Specific Model) 11 None
Source: CheckM2 predict performed on 2024-12-28

This is the table of CheckM2 analysis, there are many differences in which fields are produced, also, there appears to be differences in the values themselves for Completeness and Contamination, perhaps part of the comparitive analysis, done in the conclusion should be the date at which both were at, as older versions will probably have less accurate values, and this may cause the difference as genomics is a fast evolving field. A Noteable difference is that CheckM used an internal diamond database, whereas CheckM2 required me to download my own. Exact comparison will be conducted in the conclusion.

Conclusion

Thus far, i think that i got all of the fields in Table 2 in a more reliable format in Table 1, but with the quality checks as well, both tables suffer weirdness with “cov.” and “GC” are they the same? are they both coverage? it is as mysterious as it is annoying. Let’s see how they match up to CheckM2. It is getting late, i think that tomorrow is going to be the comparitive analysis and then moving onto the fun-looking checkm2 stuff.

📌 ?: CONCLUSION: how do comparison of all 4 tables next to each other without creating a monstrocity of indented code / something about pills/ find date difference in version history between CheckM and CheckM2 comparitive analysis/

2024-12-29 conclusion to tables + prelim investigations into funky checkm stuff

Introduction

Due to this section being largely about the conclusion from the last week, the other sections may be light, as this is largely just a wrap up i was too tired to do yesterday.

Methods

One thing that might be relevent to note is that i changed the Markdown file so that the tables generated above now come from .csv output files. This helps on cutting down the time taken to knit the document and means that there are not multiple instances of code across the document, as the results section down here will include all 4 tables in one form or another. (basically, this increases flexibility of how the data can be used). In order to find the version history for CheckM, i went here ______. For CheckM2, i went here ______ and looked at the comit history as the version on Hawk is a developer version, not one of the official release, beginning at ____.

Results

Comparison of CheckM and CheckM2

Table 5 - Comparison of quality metrics
for the same fasta files between CheckM/1.1.3 and CheckM2/0.1.3
Accession
CheckM/1.1.3
CheckM2/0.1.3
Completeness Contamination Completeness Contamination
1Dt100g 76.77 0.00 78.42 0.33
1Dt100h 98.84 0.19 99.12 0.00
1Dt1h 100.00 0.00 100.00 0.31
1Dt2d 100.00 0.90 100.00 0.87
2Dt1e 44.74 0.00 48.29 0.08
2Dt1l 98.82 2.32 100.00 0.27
3Dt1c 97.23 1.19 98.34 0.86
3Dt2c 95.75 0.85 99.94 0.39
3Dt2h 87.98 0.51 86.04 0.10
3Dt2j 96.19 1.21 99.80 0.00
Source:
CheckM lineage_wf performed on 2024-12-25
CheckM2 predict performed on 2024-12-28

📌 ?: AFTER LUNCH: pill stuff for comparison of programs directly, date the programs, do the comparison for the “raw”/ no package analyses. / look in log files for diamond version for M and M2

All tables

Flye assembly

Table 2 - Data obtained from flye outputs
of samples taken from the skin microbiome of Dendrobates tinctorius
Accession Number of contigs Total length Largest fragment Mean fragment length
3Dt2h 166 3,006,227 125,244 18,109.80
1Dt100g 153 2,104,071 74,994 13,752.10
2Dt1e 149 2,332,403 70,689 15,653.71
3Dt1c 86 3,636,266 275,317 42,282.16
3Dt2c 84 3,819,610 247,364 45,471.55
3Dt2j 75 2,813,977 195,248 37,519.69
2Dt1l 16 3,996,506 1,089,026 249,781.62
1Dt2d 6 5,345,294 2,288,975 890,882.33
1Dt1h 4 4,336,578 2,051,556 1,084,144.50
1Dt100h 2 2,379,473 2,349,045 1,189,736.50
Source: assembly_info.txt files produced in flye_asm analysis, August 27th 2024
coverage could not be calculated from this method as no result matched what was found in the flye.log files

Flye log

Table 3 - Data obtained from flye.log outputs
of samples taken from the skin microbiome of Dendrobates tinctorius
Accession Fragments Total length Fragments N50 Largest fragment Scaffolds Mean coverage
3Dt2h 166 3,006,227 26,385 125,244 0 390
1Dt100g 153 2,104,071 18,022 74,994 0 318
2Dt1e 149 2,332,403 22,011 70,689 0 344
3Dt1c 86 3,636,266 63,365 275,317 0 320
3Dt2c 84 3,819,610 69,733 247,364 0 190
3Dt2j 75 2,813,977 71,970 195,248 0 350
2Dt1l 16 3,996,506 345,773 1,089,026 0 289
1Dt2d 6 5,345,294 1,019,507 2,288,975 0 225
1Dt1h 4 4,336,578 1,857,409 2,051,556 0 404
1Dt100h 2 2,379,473 2,349,045 2,349,045 0 224
Source: flye.log files produced in flye_asm analysis, August 27th 2024

CheckM

Table 1 - subset of data obtained from CheckM analysis
of samples taken from the skin microbiome of Dendrobates tinctorius, including metrics of quality.
Accession Marker lineage Completeness Contamination GC GC std Genome size contigs Longest contig Coding density Predicted genes
1Dt1h o__Actinomycetales 100.00 0.00 65.519% 4.284% 4,336,578 4 2,051,556 90.334% 3,818
1Dt2d f__Enterobacteriaceae 100.00 0.90 53.889% 1.270% 5,345,294 6 2,288,975 87.931% 4,937
1Dt100h o__Actinomycetales 98.84 0.19 52.446% 0.564% 2,379,473 2 2,349,045 89.586% 2,071
2Dt1l o__Actinomycetales 98.82 2.32 67.742% 0.844% 3,996,506 16 1,089,026 93.707% 3,915
3Dt1c o__Sphingomonadales 97.23 1.19 66.487% 1.385% 3,636,266 86 275,317 90.660% 3,433
3Dt2j o__Sphingomonadales 96.19 1.21 65.644% 1.752% 2,813,977 75 195,248 92.779% 2,786
3Dt2c o__Sphingomonadales 95.75 0.85 66.093% 1.604% 3,819,610 84 247,364 89.769% 3,516
3Dt2h o__Actinomycetales 87.98 0.51 70.327% 1.317% 3,006,227 166 125,244 90.324% 2,978
1Dt100g o__Sphingomonadales 76.77 0.00 68.105% 1.332% 2,104,071 153 74,994 92.645% 2,179
2Dt1e o__Actinomycetales 44.74 0.00 71.162% 1.565% 2,332,403 149 70,689 89.402% 2,215
Source: CheckM lineage_wf performed on 2024-12-25

CheckM2

Table 4 - quality analysis from CheckM2
based on samples taken from the skin microbiome of Dendrobates tinctorius
Accession Completeness Contamination Completeness model used Translation table used Additional notes
1Dt1h 100.00 0.31 Neural Network (Specific Model) 11 None
1Dt2d 100.00 0.87 Neural Network (Specific Model) 11 None
2Dt1l 100.00 0.27 Neural Network (Specific Model) 11 None
3Dt2c 99.94 0.39 Neural Network (Specific Model) 11 None
3Dt2j 99.80 0.00 Neural Network (Specific Model) 11 None
1Dt100h 99.12 0.00 Gradient Boost (General Model) 11 None
3Dt1c 98.34 0.86 Neural Network (Specific Model) 11 None
3Dt2h 86.04 0.10 Neural Network (Specific Model) 11 None
1Dt100g 78.42 0.33 Neural Network (Specific Model) 11 None
2Dt1e 48.29 0.08 Neural Network (Specific Model) 11 None
Source: CheckM2 predict performed on 2024-12-28

Conclusion

For the CheckM vs CheckM2 analysis, it should be noted that ____ is older, being released at ____, whereas the other was released at ____. However, both versions found on hawk are out of date from the modern ________. It would appear that CheckM2 is not a replacement for CheckM, as both are being updated as of 2023. CheckM uses its own version of the database, so i so far dont have a date for it_____. CheckM2 requires me to download my own diamond database beforehand, this was done around the 13th of August 2024, however, the version of the database is from the 23rd of March 2021.

📌 TODO: (after conc of last bit) investigations into KO and protein stuff / just found out checkm is better documented and has some graph output options, so that is something to look into as well